emLam - a Hungarian Language Modeling baseline

نویسنده

Dávid Márk Nemeskey

چکیده

This paper aims to make up for the lack of documented baselines for Hungarian language modeling. Various approaches are evaluated on three publicly available Hungarian corpora. Perplexity values comparable to models of similar-sized English corpora are reported. A new, freely downloadable Hungarian benchmark corpus is introduced.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finite-state Transducer Base with Explicit Modeling of Ph

This article describes the design and the experimental evaluation of the first Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the recently proposed weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently read sentences selected from a major daily newspaper. Recognition perfo...

متن کامل

Explicit Modeling of Phonological Changes in Finite-state Transducer Based Hungarian Lvcsr

This article describes the operation and the experimental evaluation of the pronunciation modeling component of the first Hungarian large vocabulary continuous speech recognition system. The proposed method is based on the implementation of context dependent rewrite rules by weighted finite state transducers (WFSTs). The proposed phonological model decreases the error rate by 8.32% relatively c...

متن کامل

Finite-state transducer based hungarian LVCSR with explicit modeling of phonological changes

متن کامل

Finite-state Transducer Based Phonology and Morphology Modeling with Applications to Hungarian Lvcsr

This article introduces a novel approach to model phonology and morphosyntax in morpheme unit based speech recognizers. The proposed method is evaluated in our recent Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently r...

متن کامل

Towards Automatic Transcription of Large Spoken Archives in Agglutinating Languages - Hungarian ASR for the MALACH Project

The paper describes automatic speech recognition experiments and results on the spontaneous Hungarian MALACH speech corpus. A novel morph-based lexical modeling approach is compared to the traditional wordbased one and to another, previously best performing morph-based one in terms of word and letter error rates. The applied language and acoustic modeling techniques are also detailed. Using uns...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1701.07880 شماره

صفحات -

تاریخ انتشار 2017

emLam - a Hungarian Language Modeling baseline

نویسنده

چکیده

منابع مشابه

Finite-state Transducer Base with Explicit Modeling of Ph

Explicit Modeling of Phonological Changes in Finite-state Transducer Based Hungarian Lvcsr

Finite-state transducer based hungarian LVCSR with explicit modeling of phonological changes

Finite-state Transducer Based Phonology and Morphology Modeling with Applications to Hungarian Lvcsr

Towards Automatic Transcription of Large Spoken Archives in Agglutinating Languages - Hungarian ASR for the MALACH Project

عنوان ژورنال:

اشتراک گذاری